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(57) ABSTRACT 

A flexible framework for synchronization of multimedia 
streams synchronizes the incoming streams on the basis of 
the collaboration of a transmitter-driven and a local inter- 
media synchronization module. Whenever the first one it is 
not enough to ensure reliable synchronization or cannot 
assure synchronization because the encoder does not know 
the exact timing of the decoder, the second one comes into 
play. Normally, the transmitter-driven module uses the 
stream time stamps if their drift is acceptable. If the drift is 
too high, the system activates an internal inter-media syn- 
chronization mode while the transmitter driven module 
extracts the coarsest inter-media synchronization and/or the 
structural information present in the streams. The internal 
clock of the receiver is used as absolute time reference. 
Whenever the drift value stabilizes to acceptable values, the 
system switches back smoothly to the external synchroni- 
zation mode. The switch has a given hysteresis in order to 
avoid oscillations between internal and external synchroni- 
zation modes. 

34 Claims, 2 Drawing Sheets 
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FLEXIBLE SYNCHRONIZATION 
FRAMEWORK FOR MULTIMEDIA 
STREAMS HAVING INSERTED TIME STAMP 

BACKGROUND OF THE INVENTION 5 

The present invention relates generally to methods and 
systems for synchronizing data, and more particularly to a 
method and system for synchronizing multimedia streams. 

A common approach in multimedia inter-media and intra- 10 
media synchronization consists of introducing time-stamps 
in the media stream, which time stamps carry information 
relative to an absolute clock reference and relative to events 
to be synchronized. The Program Clock Reference (PCR) 
and Decoding Time Stamps (DTS) and Presentation Time 1S 
Stamps (PTS) of MPEG-2 are an example of such an 
approach. The receiver decodes the time-stamps and syn- 
chronizes the streams accordingly. 

In case of small amounts of jitter due to transmission of 
processing delays, mechanisms of clock recovery and jitter 20 
compensation, such as Phase Locked Loops (PLLs), are 
employed at the receiver end. Normally, these mechanisms 
work satisfactorily in the case of small amounts of jitter. 

The disadvantages of such approaches, however, are the 
lack of flexibility and adaptability. If the jitter of one of the 25 
time stamps is too high the receiver does not have the means 
to compensate it. In this case, synchronization is lost forever 
for the event to be synchronized by this time stamp. 
Furthermore, if the encoder does not know the exact timing 
of the decoder, the encoder cannot specify all the time- 30 
stamps and the method is not applicable. 

The present invention is therefore directed to the problem 
of developing a method and system for synchronizing mul- 
timedia streams that is highly reliable despite the presence of 
large amounts of jitter, and that operates independently and 35 
without a priori knowledge of the timing of the decoder 
when encoding the data. 

SUMMARY OF THE INVENTION 

40 

The present invention solves this problem by switching 
between a slave, transmitter-driven synchronization mode 
and a local (i.e., receiver-driven) synchronization mode 
whenever the temporal references arriving from the trans- 
mitter become unreliable. In the alternative, the present 45 
invention employs both slave and local synchronization 
modes to solve this problem. The present invention is 
particularly effective in the synchronization of variable rate 
data streams such as text and facial animation parameters, 
but is also applicable to other data streams, such as video and 50 
audio streams. 

According to the present invention, a method for syn- 
chronizing multimedia streams comprises the steps of using 
a transmitter-driven synchronization technique which relies 
upon a plurality of temporal references (or time stamps) 55 
inserted in the multimedia streams at an encoding end, using 
an internal inter-media synchronization technique at a 
decoding end if a performance measurement value of at least 
one of the plurality of temporal references exceeds some 
predetermined threshold, and extracting a coarsest inter- go 
media synchronization and/or structural information present 
in the multimedia stream using the transmitter-driven tech- 
nique and inputting the coarsest inter-media synchronization 
and/or structural information to a controller employing the 
internal inter-media synchronization technique. 55 

According to the present invention, it is particularly 
advantageous if the above method switches back to the 
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transmitter-driven synchronization technique whenever the 
performance measurement value stabilizes to an acceptable 
value. 

Another particularly advantageous embodiment of the 
present invention occurs when the method provides a pre- 
determined hysteresis when switching between one synchro- 
nization technique to another to avoid oscillations between 
the two synchronization techniques. 

According to the present invention, a system for synchro- 
nizing a multimedia stream includes a transmitter-driven 
synchronization controller, an internal inter-media synchro- 
nization controller and a processor. The transmitter-driven 
synchronization controller includes a control input, synchro- 
nizes the multimedia stream based on a plurality of time 
stamps in the multimedia stream inserted at an encoding end 
and extracts a coarsest inter-media synchronization and/or 
structural information present in the multimedia stream. The 
internal inter-media synchronization controller also includes 
a control input, is coupled to the transmitter-driven synchro- 
nization controller and receives the coarsest inter-media 
synchronization and/or structural information present in the 
multimedia stream. The processor receives the plurality of 
time stamps, is coupled to the control input of the 
transmitter-driven synchronization controller and the control 
input of the internal inter-media synchronization controller 
and activates the internal inter-media synchronization con- 
troller if a performance measurement value exceeds some 
predetermined threshold. 

One particularly advantageous aspect of the present 
invention is that the processor activates the transmitter- 
driven synchronization controller whenever the performance 
measurement value stabilizes to an acceptable level. 

It is also particularly advantageous when the processor 
includes predetermined hysteresis to avoid oscillations in 
switching between the internal inter-media synchronization 
controller and the transmitter-driven synchronization con- 
troller. 

Further, according to the present invention, an apparatus 
for synchronizing a facial animation parameter stream and a 
text stream in an encoded animation includes a 
demultiplexer, a transmitter-based synchronization 
controller, a local synchronization controller, a text-to- 
speech converter, a phoneme-to-video converter and a 
switch. The demultiplexer receives the encoded animation, 
and outputs a text stream and a facial animation parameter 
stream, wherein the text stream includes a plurality of codes 
indicating a synchronization relationship with a plurality of 
mimics in the facial animation parameter stream and the text 
in the text stream. The transmitter-based synchronization 
controller is coupled to the demultiplexer, includes a control 
input, controls the synchronization of the facial animation 
parameter stream and the text stream based on the plurality 
of codes placed in the text stream during an encoding 
process, and outputs the plurality of codes. The local syn- 
chronization controller is coupled to the transmitter-based 
synchronization controller, and includes a control input. The 
text-to-speech converter is coupled to the demultiplexer, 
converts the text stream to speech, outputs a plurality of 
phonemes, and outputs a plurality of real-time time stamps 
and the plurality of codes in a one-to-one correspondence, 
whereby the plurality of real-time time stamps and the 
plurality of codes indicate a synchronization relationship 
between the plurality of mimics and the plurality of pho- 
nemes. The phoneme -to- video converter is coupled to the 
text-to-speech converter, and synchronizes a plurality of 
facial mimics with the plurality of phonemes based on the 
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plurality of real-time time stamps and the plurality of codes. 
The switch is coupled to the transmitter-based synchroniza- 
tion controller and the local synchronization controller, and 
switches between the transmitter-based synchronization 
controller and the local synchronization controller based on 5 
a predetermined performance measurement of the plurality 
of codes. 

Another aspect of the present invention includes a syn- 
chronization controller for synchronizing multimedia 
streams without a priori knowledge of an exact timing of a 10 
decoder in an encoder. According to the present invention, 
the synchronization controller operates in three possible 
modes — an encoder based synchronization mode, a switch- 
ing mode and a cooperating synchronization mode. In this 
instance, the switching synchronization mode normally uses 15 
the encoder based synchronization technique, but switches 
to a decoder based synchronization technique whenever the 
encoder based synchronization technique becomes unreli- 
able. The cooperating synchronization mode uses an 
encoder based synchronization technique to provide a first 20 
level of synchronization, and uses a decoder based synchro- 
nization technique to provide a second level of synchroni- 
zation. 

Finally, the present invention is applicable to the synchro- 
nization of any multimedia streams, such as text-to-speech 25 
data, facial animation parameter data, video data, audio data, 
and rendering. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 depicts a block diagram of the system of the 30 
present invention. 

FIG. 2 depicts the environment in which the present 
invention operates. 

FIG. 3 depicts the architecture of an MPEG-4 decoder 
using text-to-speech conversion, in which the present inven- 35 
tion is used to synchronize a text stream and a facial 
animation parameter stream. 

DETAILED DESCRIPTION 

The present invention provides a framework for flexible 40 
synchronization of multimedia streams. Its main feature is 
its capability of switching between a slave, transmitter- 
driven synchronization mode and a receiver-driven or local 
synchronization mode whenever the temporal references 
arriving from the transmitter are unreliable. Alternatively, 45 
the present invention provides the capability of using both 
the slave and local synchronization modes. Consequently, 
the present invention can operate in three different modes: 
(1) a transmitter based synchronization mode; (2) a switch- 
ing mode in which a transmitter based synchronization 50 
technique is used unless the transmitter based synchroniza- 
tion becomes unreliable, in which case the system switches 
to receiver based synchronization technique; and (3) a 
cooperating mode in which a transmitter based synchroni- 
zation technique provides a first (or coarse) level of syn- 55 
chronization and a receiver based synchronization technique 
provides a second (or finer) level of synchronization. The 
present invention is particularly effective in synchronizing 
variable rale streams, such as text and facial animation 
parameters, as well as video and audio, or rendering data, eo 

The present invention synchronizes the incoming streams 
on the basis of the collaboration of a transmitter-driven and 
a local inter-media synchronization module. Whenever the 
first one is not enough to ensure reliable synchronization or 
cannot ensure synchronization because the encoder does not 65 
know the exact timing of the decoder, the second one comes 
into play. 



In a typical scenario, the transmitter-driven module uses 
the data stream time stamps if their drift is acceptable. If the 
drift of the time stamps is too high (e.g., above 80 
milliseconds), the system activates an internal inter-media 
synchronization mode, while the transmitter-driven module 
extracts the coarsest inter- media synchronization and/or the 
structural information present in the streams. (While 80 
milliseconds is used as an example of a threshold for setting 
the switching from one synchronization mode to the other, 
other values will suffice, depending upon the exact applica- 
tion and the underlying data, as well as other variables. 
Furthermore, 80 milliseconds should be interpreted as 
merely an approximation, i.e., as some value on that order of 
magnitude.) The internal clock of the receiver is used as 
absolute time reference. Whenever the drift value stabilizes 
within acceptable values (e.g., less than 80 milliseconds), 
the system switches back smoothly to the external synchro- 
nization mode. The switch uses a given hysteresis to avoid 
oscillations between the internal and external synchroniza- 
tion modes. A simple block diagram is shown in FIG. 1. 

Referring to FIG. 1, a switch U that determines which of 
the two synchronization modules 13, 15 will be used 
receives the time stamps from the encoder (not shown). The 
time stamps are then provided by the switch 11 to both of the 
synchronization modules 13, 15, if both are being used, or 
to the one module that is being used for synchronization at 
the time. In one possible embodiment of the present 
invention, the switch 11 decides which of the two modules 
to use depending upon the amount of jitter present in a 
particular time stamp. In this case, the switch is an intelligent 
switch, such as a microprocessor or software program. In 
another possible embodiment of the present invention, the 
transmitter-based synchronization controller decides which 
of the two modes to use and activates the switch. In this case, 
the switch need not be an intelligent switch, as all of the 
processing is included in the transmitter-based synchroni- 
zation controller. In either case, if the jitter exceeds 80 
milliseconds, for example, then the switch 11 switches to 
local synchronization control 15. While 80 milliseconds is 
used, other values for this decision criteria can be employed 
without departing from the present invention. 

Transmitter-Based Synchronization Control 

Transmitter-based synchronization consists of introduc- 
ing temporal references or time-stamps in the media stream, 
which temporal references carry information relative to an 
absolute clock reference and relative to events to be syn- 
chronized. This is a known technique, and by itself does not 
form part of the present invention, hence need not be 
described in further detail herein. The Program Clock Ref- 
erence (PCR) and Decoding Time Stamps (DTS) and Pre- 
sentation Time Stamps (PTS) of MPEG-2 are an example of 
such an approach. The receiver decodes the time-stamps and 
synchronizes the streams accordingly. 

Local Synchronization Control 

According to the present invention, local synchronization 
control 15 utilizes the encoder time stamps included in the 
text-to-speech text stream, which encoder time stamps are 
output by the transmitter-based synchronization control 13, 
as shown in FIG. 1. The local synchronization control 15 
reads the encoder time stamp, generates a real-time time 
stamp for the facial animation stream (or other data stream) 
and associates the correct facial animation parameter with 
the real-lime time stamp using the encoder lime stamp as a 
reference. This enables the local synchronization control 15 
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to provide additional synchronization beyond that provided 
by the transmitter-based synchronization control 13. 
Another way of looking at the present invention is that the 
transmitter-based synchronization control 13 provides 
coarse synchronization, while the receiver-based or local 5 
synchronization control 15 provides finer synchronization. If 
the synchronization is acceptable without applying the local 
synchronization, then the switch 11 never activates the local 
synchronization. Alternatively, one could operate both syn- 
chronization controllers 13, 15 without determining whether 10 
to activate the fine synchronization, which would only result 
in more processing than otherwise is necessary. 

Text and FAP Synchronization in MPEG-4 

Currently, the MPEG4 Systems Verification Model (VM) 15 
assumes that synchronization between the text input and the 
Facial Animation Parameters (FAP) input stream, is 
obtained by means of time stamps (see FIG. 2). However, 
the encoder does not know the timing of the decoder TTS. 2Q 
Hence, the encoder cannot specify the alignment between 
synthesized words and the facial animation. Furthermore, 
timing varies between different TTS systems. 

Another aspect of the present invention is as follows. 
Encoder time stamps (ETS) are included in the text string 25 
transmitted to the TTS and in the FAP stream transmitted to 
the receiver. The transmitter-driven synchronization control 
reads the ETS and passes it to the local inter media syn- 
chronization module, which generates a real-time time 
stamp (RTS) for the facial animation system and associates 30 
the correct FAP with the real-time time stamp using the 
encoder time stamp of the bookmark as a reference. Simi- 
larly a decoder time stamp (DTS) is generated for the 
decoder system. The transmitter-driven synchronization 
control checks that the ETS is met with a maximum drift of 
80 milliseconds, for example. 

Referring to FIG. 3, according to the present invention, 
the synchronization of the decoder system can be achieved 
by using local synchronization by means of event buffers at 
the input of the FA/AP/MP 4 and the audio decoder 10. 40 

A maximum drift of 80 msec between the encoder time 
stamp (ETS) in the text and the ETS in the Facial Animation 
Parameter (FAP) stream is tolerable. 

One embodiment for the syntax of the bookmarks when 
placed in the text stream consists of an escape signal 45 
followed by the bookmark content, e.g., \!M {bookmark 
content}. The bookmark content carries a 16-bit integer time 
stamp ETS and additional information. The same ETS is 
added to the corresponding FAP stream to enable synchro- 
nization. The class of Facial Animation Parameters is 50 
extended to carry the optional ETS. 

If an absolute clock reference (PCR) is provided, a drift 
compensation scheme can be implemented. Please note, 
there is no master slave notion between the FAP stream and 55 
the text. This is because the decoder might decide to vary the 
speed of the text as well as a variation of facial animation 
might become necessary. For example, if an avatar needs to 
react to visual events occurring in its environment, such as 
when Avatar A is talking to a user, and Avatar B enters the 6o 
room. In this case, a natural reaction of Avatar A is to look 
at Avatar B, smile and while doing so, slow down the speed 
of the spoken text. 



35 



Autonomous Animation Driven Mostly by Text 

In the case of facial animation driven by text, the addi- 
tional animation of the face is mostly restricted to events that 



65 



do not have to be animated at a rate of 30 frames per second. 
Especially high-level action units like smile should be 
defined at a much lower rate. Furthermore, the decoder can 
do the interpolation between different action units without 
tight control from the receiver. 

The present invention includes action units to be animated 
and their intensity in the additional information of the 
bookmarks. The decoder is required to interpolate between 
the action units and their intensities between consecutive 
bookmarks. This provides the advantages of authoring ani- 
mations using simple tools, such as text editors, and signifi- 
cant savings in bandwidth. 

FIG. 2 depicts the environment in which the present 
invention is to be used. The multimedia stream is created and 
coded in the encoder section 1. The encoded multimedia 
stream is then sent through a communication channel (or 
storage) to a remote destination. At the remote destination, 
the multimedia is recreated by the decoder 2. At this stage, 
the decoder 2 must synchronize the one set of events (e.g., 
facial animations) with another set of data (e.g., the speech 
of the avatar) using only information encoded with the 
original multimedia streams, 

FIG. 3 depicts the MPEG-4 architecture of the decoder, 
which has been modified to operate according to the present 
invention. The signal from the encoder 1 (not shown) enters 
the Demultiplexer (DMUX) 3 via the transmission channel 
(or storage, which can also be modeled as a channel). The 
DMUX 3 separates outs the text and the video data, as well 
as the control and auxiliary information. The FAP stream, 
which includes the Encoder Time Stamp (ETS), is also 
output by the DMUX 3 directly to the FA/AP/MP 4, which 
is coupled to the Text-to-Speech Converter (TTS) 5, a 
Phoneme FAP converter 6, a compositor 7 and a visual 
decoder 8. A Lip Shape Analyzer 9 is coupled to the visual 
decoder 8 and the TTS 5. User input enters via the com- 
positor 7 and is output to the TTS 5 and the FA/AP/MP 4. 
These events include start, stop, etc. 

The TTS 4 reads the bookmarks, and outputs the pho- 
nemes along with the ETS as well as with a Real-time Time 
Stamp (RTS) to the Phoneme FAP Converter 6. The pho- 
nemes are used to put the vertices of the wireframe in the 
correct places. At this point the image is not rendered. 

This data is then output to the visual decoder 8, which 
renders the image, and outputs the image in video form to 
the compositor 7. It is in this stage that the FAPs are aligned 
with the phonemes by synchronizing the phonemes with the 
same ETS/RTS combination with the corresponding FAP 
with the matching ETS. 

The text input to the MPEG-4 hybrid text-to-speech 
(TTS) converter 5 is output as coded speech to an audio 
decoder 10. In this system, the audio decoder 10 outputs 
speech to the compositor 7, which acts as the interface to the 
video display (not shown) and the speakers (not shown), as 
well as to the user. 

On the video side, video data output by the DMUX 3 is 
passed to the visual decoder 8, which creates the composite 
video signal based on the video data and the output from the 
FA/AP/MP 4. 

There are two different embodiments of the present inven- 
tion. In a first embodiment, the ETS placed in the text stream 
includes the facial animation. That is, the bookmark (e.g., 
escape sequence) is followed by a 16 bit codeword that 
represents the appropriate facial animation to be synchro- 
nized with the speech at this point in the animation. 

Alternatively, the ETS placed in the text stream acts as a 
pointer in time to a particular facial animation in the FAP 
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stream. Specifically, tbe escape sequence is followed by a 16 8. The method according to claim 7, further comprising 

bit code that uniquely identifies a particular place in the FAP the step of generating a decoder time stamp for a decoder, 

stream. 9- The method according to claim 1, further comprising 

While this aspect (i.e., the local synchronization) of the the ste P of determining that the encoder time stamp contains 

present invention has been described in terms of animation s » m » 3Q ? um d " ft ° f approximately 80 milliseconds. 

data, the animation data could be replaced with natural audio . 10 n ° method acc ° rdl . n 8 to claim m f ,h f 

' . . . j ♦ f *u ♦ ■ *™ ~ \ji formance measurement value comprises a dntt ot at least 

or video data, or rendering data for synthetic imaging. More ^ stam r 

specifically, the above description provides a method and \ m ^I5hod according to claim 10, wherein the 

system for aligning animation data with text-to-speech data. predetermined threshold comprises approximately 80 milli- 

However, the same method and system applies if the text- i° seconds 

to-speech data is replaced with audio or video or rendering n A system for synchronizing multimedia streams corn- 
data. In fact, the alignment of the two data streams is prising: 

independent of the underlying data, particularly with regard a ) a transmitter-driven synchronization controller having 

to the TTS stream. Furthermore, the present invention a control input, synchronizing the multimedia streams 

applies to any two data streams in which the encoder does based 0Q a pi ura lity of time stamps in the multimedia 

not contain a priori knowledge of the exact timing of the streams inserted at an encoding end, and extracting a 

decoder. coarsest inter-media synchronization and/or structural 

What is claimed is: information present in the multimedia stream; 

1. A method for synchronizing multimedia streams com- b) an internal inter-media synchronization controller hav- 
prising the steps of: 20 ing a control input, being coupled to the transmitter- 

a) using a transmitter-driven synchronization technique driven synchronization controller and receiving the 
that relies upon a plurality of time stamps inserted in coarsest inter-media synchronization and/or structural 
the multimedia streams during an encoding process; information present in the multimedia streams; and 

b) using an internal inter-media synchronization tech- c) a processor receiving the plurality of time stamps, 
nique during a decoding process if a performance 25 ^emg coupled to the control input of the transmitter- 

7 w nf „ M n e „i.. M i;t» ~r driven synchronization controller and the control input 

measurement value of at least one ot the plurality ot _ , . J . . ... A ,f 

j _ a * a *i u i a of the internal inter-media synchronization controller, 

time stamps exceeds some predetermined threshold. , . J . ' 

i TU. 4 i j . i •„ 1 t .u ™ ■ and activating the internal inter-media synchronization 

2. The method according to claim 1, further comprising * 

the ste of* controller if a performance measurement value exceeds 

v . • ,. L . * 30 some predetermined threshold. 

c) extracting a coarsest inter-media synchronization and/ * * . -i ■ . i • 1 -i a -u _ • • 

7 , . , . j* 13- The system according to claim 12, further comprising 

or structural information present in the multimedia . t . , , , . , , „ t . . . , 

4| _ K„ , . 4 , . . an internal clock being coupled to the processor, the internal 

streams using the transmitter-driven technique and . ,. , ■ *■ * n a *u ♦ 

. & . ,. . • . inter-media synchronization controller, and the transmitter- 
inputting the coarsest inter-media synchronization and/ ,. U • .11 A U 1 * 
* & , . - . 3 „ . . . dnven synchronization controller, and actmg as an absolute 
or structural information to a controller employing the ^ reference 

internal inter-media synchronization technique. 1i( t, ' A . . , • . tU 

- ^ , . J , • ^ o , • • 14. The system according to claim 12, wherein the per- 

3. The method according to claim 1. further comprising c J , , , . r . . 
, r -.i_-. iii r • formance measurement value comprises a drift value ot at 
the step of using the internal clock of a receiver as an . c . r 

. . * . c least one of the time stamps. 

abso ute time reference, le-m. . a * i * va u • ,l 

A m. . * ,. _ . . ^ - . . 15. The system according to claim 14, wherein the pro- 

4. THe method according to claim 1 further comprising 4Q ^ ^ transmi f ler . driven synchroni -, tion con- 
the step of switching back to the transmitter-dnven synchro- ^ wbenever ^ drfft ya , ue Qf ^ ^ stam s(abjlizes 

mzation technique whenever the performance measurement tQ acceota bi e values 

value stabilizes to an acceptable level 6 ^ e [Q claim u wherei . ^ 

5. Tte method according to claim 1 further comprising ^ ^ predeterm ined hysteresis to avoid oscUla- 
the step of providing predetermined hysteresis when switch- 4J ^ m ^ imer . media chfQ . 
mg between one synchronization technique to the : other to nization conlroller and the lransmitle r-driven 
avoid oscillations between the two synchronization tech- synchronizatio - comroUer. 

n '1 U ^. , . ,. • • 17. The system according to claim 12, wherein the coars- 

the t^r 3 S 10 compnsmg ^ i mer . m / dia synchroniz S alion and/or structura i informa . 

te stepo. .50 tion comprises an encoder time stamp. 

d) including an encoder time stamp in a text string 18. The system according to claim 12, wherein the coars- 
transmitted to a text-to-speech converter and m a facial ^ inter . media synchronization and/or structural informa- 
animation parameter stream transmitted to a receiver {{on dses an encoder time stam and a encoder real . 
that is to combine converted speech with facial anima- t - me t -_ ne stam p 

^° ns " s . „ _ , ..55 19. The system according to claim 12, wherein the coars- 

7. The method according to claim 6, further comprising est inter . media synchronization and/or structural informa- 

steps ot. t j on comprises a decoder time stamp and an encoder time 

e) reading the encoder time stamp with a transmitter- stamp. 

driven synchronization controller; 2 0. The system according to claim 18, wherein the inter- 

f) passing the encoder time stamp to an internal inter- 60 na i inter-media synchronization controller associates a cor- 
media synchronization controller; rec t facial animation parameter with the real-time time 

g) generating a real-time time stamp for a facial animation stamp using the encoder time stamp as a reference, 
system with the internal inter-media synchronization 21. The system according to claim 12, wherein the per- 
controller; and formance measurement value comprises a drift of at least 

h) associating a correct facial animation parameter with 65 one time stamp exceeding a predetermined threshold, and 
the real-time time stamp using the encoder time stamp the transmitter-driven synchronization controller measures 
as a reference. the drift of the encoder time stamp. 
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22. The system according to claim 21, wherein the pre- 
determined threshold comprises approximately 80 millisec- 
onds. 

23. An apparatus for synchronizing a facial animation 
parameter stream and a text stream in an encoded animation 
comprising: 

a) a demultiplexer receiving the encoded animation, out- 
putting a text stream and a facial animation parameter 
stream, wherein said text stream includes a plurality of 
codes indicating a synchronization relationship with a 
plurality of mimics in the facial animation parameter 
stream and the text in the text stream; 

b) a transmitter-based synchronization controller being 
coupled to the demultiplexer, having a control input, 
controlling the synchronization of the facial animation 
parameter stream and the text stream based on the 
plurality of codes placed in the text stream during an 
encoding process, and outputting the plurality of codes; 

c) a local synchronization controller being coupled to the 
transmitter-based synchronization controller, and 
including: 

(i) a control input; 

(ii) a text-to-speech converter coupled to the 
demultiplexer, converting the text stream to speech, 
outputting a plurality of phonemes, and outputting a 
plurality of real-time time stamps and the plurality of 
codes in a one-to-one correspondence, whereby the 
plurality of real-time time stamps and the plurality of 
codes indicate a synchronization relationship 
between the plurality of mimics and the plurality of 
phonemes; and 

(iii) a phoneme to video converter being coupled to the 
text-to-speech converter, synchronizing a plurality of 
facial mimics with the plurality of phonemes based 
on the plurality of real-time time stamps and the 
plurality of codes; and 

c) a switch being coupled to the transmitter-based syn- 
chronization controller and the local synchronization 
controller and switching between the transmitter-based 
synchronization controller and the local synchroniza- 
tion controller based on a predetermined performance 
measurement of the plurality of codes. 

24. The apparatus according to claim 23, further com- 
prising a compositor converting the speech and video to a 
composite video signal. 
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25. The apparatus according to claim 23, wherein the 
phoneme to video converter includes: 

a) a facial animator creating a wireframe image based on 
the synchronized plurality of phonemes and the plural- 
ity of facial mimics; and 

b) a visual decoder being coupled to the demultiplexer 
and the facial animator, and rendering a video image 
based on the wireframe image. 

26. The apparatus according to claim 23, wherein the 
predetermined performance measurement comprises a drift 
value of one of the time stamps exceeding a predetermined 
threshold. 

27. The apparatus according to claim 26, wherein the 
predetermined threshold comprises approximately 80 milli- 
seconds. 

28. The apparatus according to claim 26, wherein the 
switch activates the transmitter-driven synchronization con- 
troller whenever the drift value stabilizes to an acceptable 
level. 

29. A synchronization controller for synchronizing mul- 
timedia streams without a priori knowledge of an exact 
timing of a decoder in an encoder comprising: 

a) an encoder based synchronization mode; 

b) a switching synchronization mode, which uses an 
encoder based synchronization technique and switches 
to a decoder based synchronization technique when- 
ever the encoder based synchronization technique 
becomes unreliable; and 

c) a cooperating synchronization mode, which uses an 
encoder based synchronization technique to provide a 
first level of synchronization, and uses a decoder based 
synchronization technique to provide a second level of 
synchronization. 

30. The controller according to claim 29, wherein the 
multimedia streams include text-to-speech data. 

31. The controller according to claim 29, wherein the 
multimedia streams include facial animation data. 

32. The controller according to claim 29, wherein the 
multimedia streams include rendering data. 

33. The controller according to claim 29, wherein the 
multimedia streams include audio data. 

34. The controller according to claim 29, wherein the 
multimedia streams include video data. 
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